Automatic Textual Document Categorization Using Multiple Similarity-Based Models
نویسندگان
چکیده
We develop a similarity-based textual document categorization method called the generalized instance set (GIS) algorithm. GIS integrates the advantages of linear classifiers and k-nearest neighbour algorithm by generalization of selected instances. To further enhance the performance, we propose a meta-model framework which combines the strength of different variants of GIS algorithm as well as state-of-the-art existing algorithms using multivariate regression analysis on document feature characteristics. Document feature characteristics, derived from the training document set, capture some inherent properties of a particular category. Different from existing categorization methods, our proposed meta-model can automatically recommend a suitable algorithm for each category based on the category-specific statistical characteristics. In addition, our meta-model differs from existing multi-strategy learning in that our approach is not limited to the number and type of component classifiers. By flexible addition and substitution of different classifiers, incremental classification performance can be obtained. Extensive experiments have been conducted. The results confirm that our meta-model approach can exploit the advantage of its component algorithms, and demonstrate a better performance than existing algorithms. ∗Corresponding Author: Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong. {[email protected]} †Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong. {[email protected]}
منابع مشابه
A survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملTopicWeb: A Novel Approach to Automatic Document Similarity Measurement and Categorization
The project introduces the TopicWeb, a system for improved similarity measurement between two textual documents (news articles, emails, etc). It consists of an interconnected web of documents, where a connection between two documents specifies the strength of the similarity between the topics of those documents. A standard method of calculating similarity between two documents is improved upon ...
متن کاملAn Effective Sentence Ordering Approach For Multi-Document Summarization Using Text Entailment
With the rapid development of modern technology electronically available textual information has increased to a considerable amount. Summarization of textual information manually from unstructured text sources creates overhead to the user, therefore a systematic approach is required. Summarization is an approach that focuses on providing the user with a condensed version of the original text bu...
متن کاملSimilarity Model and Term Association for Document Categorization
This paper addresses similarity model and term association for similarity-based document categorization. Both Euclidean distance– and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization community. These two similarity models are based on the assumption that term vectors are orthogonal. Term associations are igno...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001